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Abstract 

Additive isotonic regression attempts to determine the relationship between a multi-dimensional 
observation variable and a response, under the constraint that the estimate is the additive sum of 
univariate component effects that are monotonically increasing. In this article, we present a new 
method for such regression called LASSO Isotone (LISO). LISO adapts ideas from sparse linear 
modelling to additive isotonic regression. Thus, it is viable in many situations with high dimensional 
I I predictor variables, where selection of significant versus insignificant variables are required. We 

suggest an algorithm involving a modification of the backfitting algorithm CPAV. We give a numerical 
convergence result, and finally examine some of its properties through simulations. We also suggest 
some possible extensions that improve performance, and allow calculation to be carried out when 
the direction of the monotonicity is unknown. 

rj^ Keywords: Nonparametric regression; Isotonic regression; Lasso 

^ 1 Introduction 

0^ We often seek to uncover or describe the dependence of a response on a large number of covariates. In 

as described, for instance in Hastie and Tibshirani 1990| , is well known to be an useful generalisation. 



in 



many cases, parametric and in particular linear models may prove overly restrictive. Additive modelling, 



o 

Suppose we have n observations available of the pair (Xi^Yi), where 1^ S M is a response variable, 
> and A, = (A^ \ . . . , A^^^^) e is a vector of covariates. 

In additive modelling, we typically assume that the data is well approximated by a model of the form 

where e = (ei, . . . , e„) is a random error term, assumed independent of the covariates and identically 
distributed with mean zero. For every covariate k = 1, . . . ,p, each component fit fk is chosen from a 
space of univariate functions Tk- Usually, these spaces are constrained to be smooth in some suitable 
sense, and in fitting, we minimise the L2 norm of the error, 
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under the constraint that fk G J-k, for each fc = 1, . . . ,p. In the case that e is assumed to be normal, 
this can be directly justified as maximising the likelihood. 

Work on such methods of additive modelling have produced a profuse array of techniques and gen- 



eralisations. In particular, Bacchetti 1989 suggested the additive isotonic model. With the additive 
isotonic model, we are interested in tackling the problem of conducting regression under the restriction 
that the regression function is of a pre-specified monotonicity with respect to each covariate. (Isotonic 
means the functions are increasing, though decreasing can be accommodated easily by reversing the signs 
of covariates.) Such restrictions may be sensible whenever there is subject knowledge about the possible 
infiuence or relationship between predictor and response variables. A broad survey of the subject may 



be found in Barlow et al. 1972 . It turns out that in the univariate case, the Pool Adjacent Violators 



Algorithm, as first suggested in Ayer et al. 1955 , allows rapid calculation of a solution to the least 
squares problem using this restriction alone. By doing so, we retain only the ordinal information in the 
covariates, and hence obtain a result that is invariant under strictly monotone transformations of the 
data. In addition, the form of the regression, being simply a maximisation of the likelihood, means that 
apart from the monotonicity constraint, we do not put on any regularisation, or smoothing. 



Bacchetti 1989 built on this, by generalizing to multiple covariates. Here, the regression function 



is considered to be a sum of univariate functions of specified monotonicity. Fitting is conducted via the 
cyclic pool adjacent violators (CPAV) algorithm, in the style of a backfitting procedure built around 
PAVA — that is, cycling over the covariates, the partial residuals using the remaining covariates are 
repeatedly fitted to the current one, until convergence. Later theoretical discussion from |Mammen and| 



Yu 2007 outlined some positive properties of this procedure. 



Nevertheless, CPAV, like many types of additive modelling, can fail in the high dimensional case — 
for instance, once p > n. The particular problem is that the least squares criterion loses strictness of 
convexity when the number of covariates is large, since it becomes easy for allowed component fits in 
some covariates to combine in the training data so as to replicate component fits in unrelated covariates. 
It is hence impossible for the CPAV to distinguish between two radical different regression functions since 
they give the same fitted values on the training dataset. Some success might be achieved, though, if the 
solution sought is sparse, in the sense that most of the covariates have little or no effect on the response. 
If the identity of the significant variables were known, then, the CPAV could be conducted on a much 
smaller set of covariates. However, exhaustive search to identify this sparsity pattern would be rapidly 
prohibitive in terms of computational cost, scaling exponentially in the number of covariates. 

In the context of parametric linear regression, it has emerged recently that such sparse regression 
problems can be dealt with by use of a Ll-norm based penalty in the optimisation. This can resolve the 



identifiability problem and achieve good predictive accuracy. Tibshirani 1996 , Donoho 2006 , amongst 
others, have identified several significant empirical and theoretical results to support this 'LASSO' es- 



timator, while Efron et al. 2004 , Friedman et al. 2007 and others have invented fast algorithms for 
calculating both individual estimates and full LASSO solution paths. 

Generalisation of the LI penalisation principle to nonparametric regression can also lead to successful 



with additive modelling. For example, recent work on this subject includes SpAM Ravikumar et al. 



2007 , which describes the application of the grouped LASSO to general smoothers, and high dimensional 
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additive modelling with smoothness penalties Meier et al.||2009 which follows similar principles, using 
a spline basis. 

In this paper, we propose the Lasso-Isotone (LISO) estimator. By modifying the additive isotonic 
model to include a LASSO-style penalty on the total variation of component fits, we hope to conduct 
isotonic regression in the sparse additive setting. 



The LISO is similar to the degree case of the LASSO knot selection of Osborne et al. 1998 , which 



is also identical to the fused LASSO of Tibshirani et al. 2005 , if we replace the covariate matrix with 



ordered Haar wavelet bases, and do not consider coefficient differences for coefficients corresponding to 



different covariates. It is also similar to the univariate problem considered by Mammen and van de 



Geer 1997 . In contrast to each of these procedures, however, we allow the additional imposition of a 



monotonicity constraint, producing an algorithm similar in complexity to the CPAV. 

In section [2] we shall describe the LISO optimisation, and in section |3] we will discuss algorithms 
for computation for fairly large n and p. We will discuss the effect of the regularisation, and then in 
section [4] suggest some extensions. Finally, in section [5] we will explore its performance using some 
simulation studies. Proofs of theorems are left for the appendix. 

2 The LASSO-ISOtone Optimisation 

For now, let us assume without loss of generality that we are conducting regression constrained to 
monotonically increasing regression functions. Let us first define some terms. 

Let Y e M" be the response vector. Assume, subtracting by a constant intercept term if necessary, 
that YJLi = 0. X= . . . , A:(p)) G is the matrix of covariates. 

For a specified X, for fc = 1, . . . ,p, let J-k be the space of bounded, univariate, and monotonically 
increasing functions, that have expectation zero on the A:-th covariate. —J-k is then the same for mono- 
tonically decreasing functions. 



n 

f (xf'') = 0, and 3U,V s.t. Va < 6, C/ < /(a) < f{b) < V 



Additive isotonic models involve sums of functions from these spaces. It is simple to observe that 
each J^k is a convex half-space that is closed except at infinity, and so as a result the space of sums of 
these functions must also be convex and closed except at infinity. 

Definition 2.1. We define the Lasso-Isotone (LISO) solution for a particular value of tuning parameter 

A > as the minimiser f\—(fk\] , with fk \ G J-k^k, of the LISO loss 

V ' /fe=i 



L\{fi,.. fp) 



fc=i 



A^A(A) 



(2.1) 



fe=i 



Here A{fk) denotes the total variation of fk, which for fk G J-k can be calculated as 



A{g) = sup/fc(x) - inf fk{x) 
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As with the LASSO, the LISO objective function is the sum of a log-Ukehhood term and a penalty 
term. It is clear that the domain is convex and, considered in the space of allowed solutions, the objective 
itself is convex and bounded below. Indeed, outside a neighbourhood of the origin, both terms in the 
objective are increasing, so a bounded solution exists for all values of A. However, the objective may not 
be strictly convex, so this solution may not be unique. 

The log-likelihood term does not consider the values of fk except at observed values of each covariate, 
while the total variation penalty term, assuming monotonicity, only takes account of the upper and lower 
bounds of the covariate- wise regression function — indeed, for optimality, these bounds must be attained 
at the extremal observed values of the appropriate covariate, with the solution flat beyond this region. 
Thus, given any one minimiser to Lx, another fit with the same function values at observed covariate 
points, interpolating monotonically between them, will have the same value of L\, and so also be a LISO 
solution. This means that the we can equivalently consider optimisation in the finite dimensional space 
of fitted values /^(XC^)). 

For simplicity, we will represent found LISO solution components by the corresponding right continu- 
ous step function with knots only at each observation. For the remainder of this paper, we shall consider 
uniqueness and equivalence in terms of having equal values at the observed X^^K 

We have introduced a mean zero constraint on the fitted components for identifiability, since we 
can easily add a constant term to any component fit fk, and deduct it from another component, and 
still arrive at the same final regression function. However, we will show later that this constraint arises 
naturally in the univariate case, where even without it being explicitly applied, 

n n 



The total variation penalty shown here has been previously suggested for regression in Mammen and 



van de Geer 1997 , though in that case, the focus was on smoothing of univariate functions, without a 



monotonicity constraint. 

3 LISO Backfitting 

Considering the representation of the LISO in terms of step fimctions, the LISO optimisation for a 
given dataset can be viewed as ordinary LASSO optimisation for a linear model, constrained to positive 
coefficients, using an expanded design matrix X e K"xp("-i)^ where X = Each 
X(^) G k ^ 1, . . . ,p contains n — 1 step functions in the A:-th covariate, which form a basis 

for the vector /^(A'f'''^), and so isotonic functions in that covariate. The coefficients /3 optimised over 
then represent step sizes. 



Such a construction is suggested in Osborne et al. 1998 , amongst others. Under this re-parametrisation 



of the problem, existing LASSO algorithms for linear regression may be applied, with a modification to 
restrict solutions to non-negative values. In particular, the Least Angle Regression algorithm of |Efron| 



et al. 2004 is effective, since shortcuts exist for calculating the necessary correlations. 

On the other hand, the high dimensionality of X means that standard methods become very costly 
in higher dimensions, both in terms of required computation, but especially in terms of the storage 
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requirements associated with very large matrices. Hence, we must consider more specialised algorithms 
for such cases. One such approach involves backfitting, and is workable due to the simple form of the 
solution when restricted to a single covariate. 



3.1 Thresholded PAVA 

In the p = 1 case, it turns out that we have an exceptionally simple way to calculate the LISO estimate, 
which we will later use to establish a more general multivariate procedure. 

With no LISO penalty (i.e. A = 0) and a single covariate, the LISO optimisation is equivalent to the 
standard univariate isotonic regression problem. In this case, the loglikelihood residual sum of squares 
term is strictly convex, and so, as a strictly convex optimisation on a convex set, an unique solution exists. 



Trivially, the solution must also be bounded. In fact, there exists, as described in Barlow et al. 1972 



and attributed to |Ayer et al. [1955 , a fast algorithm for calculating the solution - the Pool Adjacent 



Violators Algorithm (PAVA). 



Hence, defining fx as the solution to optimisation (2.1) for A, we have /o — Jpava ■ The following 
theorems describe the solutions for other values of A: 

Theorem 1. For A < B, denote by fyA,<B the Winsorized PAVA estimate 



f>A,<B{x) ■■= < 



A iffpAVA{x)<A 
B iffpAVA{x)>B 
fpAVA{x) otherwise. 



Then if p — 1, there exist thresholds Ax < Bx for each value of X > such that the LASSO-Isotone 
solution is given by fx = f>Ax.<Bx- 

Theorem 2. In Theorem^ given fpAVA, the pair Ax, Bx (the optimal thresholding levels) are a piecewise 
linear, continuous and monotone (increasing for Ax, decreasing for Bx) function of A, for A > 0. 
Specifically, if 

n 

2X>Y,\fpAVA{X,)-Y\, (3.1) 
1=1 

then Ax = Bx = Y. 

Otherwise, Ax,Bx are the solutions to 

n 

Y,Upava{X,)^Bx)+ = \ (3.2) 

n 

Y^iAx - fpAVA{X,))+ = A. (3.3) 

i=l 

Corollary 3. Let tt be a permutation taking 1, . . . ,n to indices that put X in ascending order. Then if 

m 

A > max V (y,(,) - F) (3.4) 

1=1 

Ax=Bx= Y. 
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Remark 3.1. The LHS of (3.2| and (3.3) specify the amount by which each threshold changes the 
sum of the fitted values on the appropriate side of the mean. Hence, we see that '^"^i f\{Xi) = 
EiLi fpAVA{Xi) = X;r=i for aU A. 

In other words, if Y has mean zero, then the mean zero constraint on the fit arises naturally, without 
having to be externally applied. If Y does not have mean zero, the solution is simply a shifted version 
of the fit for Y ~ Y. This justifies deducting the mean of the response and dealing with it separately. 

Remark 3.2. The PAVA algorithm itself can accommodate observation weights, as well as tied values 
in the covariates. In terms of the LISO, working with unequal observation weights demands that we 
work with weighted residual sums of squares. This does not affect Theorem [ij but for equations (3.3) 



and (3.2 1, weights should be introduced in the summation. Tied values should be also dealt with by 
merging the relevant steps, and weighting them according to the number of data points at that covariate 
observation. 

3.2 Backfitting algorithm 

In general, however, simple thresholding fails to solve the LISO optimisation in higher dimensions, due 
to correlations between steps in different covariate component functions. We can, however, extend the 
ID algorithm to higher dimensions by applying it iteratively as a backfitting algorithm. 
In other words, we define LISO-backfitting by the following steps: 

Algorithm 1 LISO-Backfitting 
1: Set TO = 0. 

2: Initialise component fits (/i, . . . as identically 0, or as the estimate for a different value of A, 

storing these as the n x p marginal fitted values. 
3: repeat 

4: /"^(/l,...,/p). 
5: TO TO + 1. 

6: for k — 1 to p (or a random permutation) do 

7: Recalculate residuals y,; — X]fc=i (^^^''^^ , i — 1, . . . ,n. 

8: Refit conditional residual jr^ + fk (^X^''^^ |" using X^^) by PAVA, producing fk (aT^ for 
i — 1, . . . , n. 



9: Calculate thresholds Ax, B\ from A and fk by Theorem [i] 
10: Adjust component fit A.(xf ^) ^ Jk,>A,<BAx\''^)- 
11: end for 

12: until sufficient convergence is achieved, through considering /™ and f^' 

(k) 

13: Interpolate fk between the samples X^ . 



Theorem 4. For /™ = (/™, . . . , /™), the sequence of states resulting from the LISO-backfitting algo- 
rithm, La(/™) converges to its global minimum with probability 1. Specifically, if there exists an unique 
solution to /™ converges to it. 
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Remark 3.3. If there is no unique solution, the backfitting algorithm may not necessarily converge, though 
the LISO loss of each estimate will converge monotonically to the minimum. In addition, because the 
objective function is locally quadratic, as the change in the LISO loss converges to zero, the change in 
the estimate after each individual refitting cycle converges also to zero. 

Remark 3.4. Moreover, defining x'^^^ as the i-ih smallest value of X^''\ if a certain individual step in 
the final functional fit 

has a value of zero in all solutions to the LISO minimisation, then, after a finite number of steps, all 
results from the algorithm must take that step exactly to zero. 

This is because steps being estimated as zero in a LISO solution implies that the partial derivative 
of the LISO objective function L\ in the above individual step direction is greater than zero when 
evaluated at this solution. The partial derivatives are continuous, so as the algorithm converges, the 
partial derivatives associated with zero steps eventually be above and remain so. But then, this can 
only be the case following a thresholded PAVA calculation involving the covariate associated with that 
step if that single covariate optimisation takes the step exactly to zero. 

Convergence of the algorithm can be checked for by a variety of methods. One of the simplest is 
to note that due to the nature of the repeated optimisation, the LISO loss will always decrease in each 
step, and we will converge towards the minimum. Hence, one viable stopping rule would be to cease 
calculating when the LISO loss of the current solution drops by too small an amount. Alternatively, we 
can exploit Remark |3.3[ and monitor the change in the results in each cycle, stopping when this becomes 
small. 

3.3 Choice of regularisation parameter 

It will be always necessary to choose a tuning parameter A to facilitate appropriate fitting. As with 
the LASSO, too high a tuning parameter will shrink the fits towards zero. Indeed, consideration of 
Corollary [3] shows that, with F = 0, and tt^'^^ defined as a permutation that puts the fc-th covariate into 
ascending order, a choice of A greater than 

m 

^7r('»)(i) 

i=l 

will result in a zero fit in every thresholded PAVA step starting from zero, and hence a zero fit overall 
for the LISO. 

Conversely, too small a value of A will lead to improper fitting. This arises from two sources. Firstly, 
as with the LASSO, the noise term may flood the fit, as the level of thresholding is not sufficient to 
suppress correlations of the noise with the covariate step functions - the columns of X. Secondly, A has 
a role in terms of flt complexity, with a small value of A implying that the LISO, when restricted to the 
true covariates, would select more steps. This means a less sparse signal in the implied LASSO problem, 
so it becomes in turn more likely for selected columns of X to be correlated with columns belonging to 
irrelevant covariates, hence producing spurious fits in the other covariates. 



max 



7 



More precisely, in the noiseless case, if the true model function can be written exactly as the sum 
of step functions with, in the expanded design matrix X, corresponding column indices S, then correct 
recovery, given that LISO has fit non-zero fits to the true step functions, requires 



> (y - Xs [x^Xs) ' {X^Y - A) 



Xga Xs ( XgXs ] A 



(3.5) 
(3.6) 



This is the Irrepresentable Condition of the LASSO, as detailed in Zhao and Yu 2006 , Meinshausen 



and Biihlmann 2006 , and it may fail if S is too large. With the LISO, then, the particular choice of A 



itself influences the form the true covariates can take and so alters the criterion for Irrepresentability. 



> 






J J 'J T ) 
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Figure 3.1: Effects of changing the regularisation parameter in the noiseless case, n = 100, p = 200. 
Each line represents how an individual covariate's estimate changes as A varies, with the solid lines for 
the true covariates, while the dashed lines denote spurious fits on irrelevant variables. 



These effects are illustrated in Figure |3.1[ in which we have generated X, with n — 100, p=200, 
according to an uniform distribution, and produced Y as the sum of fc = 5 of the covariates. In other 
words, / is the sparse sum of linear functions. We give the full paths of fits in terms of, firstly, the total 
variation of fitted components A(/fc), and secondly the number of component steps in each covariate. 



{.:A(4?)^A(4^^.)} 



Of particular note is that, unlike the LASSO, even without noise, the size of the basis of step functions 
and the non-sparsity of the true signal means that as A — > 0, we do not converge to the true sparsity 
pattern. However, with higher A, the number of steps we choose diminishes rapidly, and as a result we 
can remove the spurious fits and simultaneously not mistakenly estimate the relevant covariates as zero. 

In Figure |3.2[ we add an independent normal noise component to Y, with variance chosen so that 
the signal to noise ratio, SNR = 5. In the new Total Variation plot, we see that the noise component 
has added additional noise fits in some of the irrelevant variables, and as in the LASSO these vanish for 
higher A. Since the spurious fits vanish before the true covariate components do, we see that recovery of 
the true sparsity pattern is still possible in this case. 

Now, in the above examples, we worked with the true sparsity pattern being assumed known. In real 
problems, we need to estimate the correct value of A directly from the data. To do this, with the goal of 
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Figure 3.2: Effects of changing the regularisation parameter in the noisy case, n — 100, p — 200, SNR = 
5. We show again in the first graph the total variation of each covariate estimate as A ahers, with sohd 
hnes for the truly important covariates, while the dashed lines denote spurious fits on irrelevant variables. 
The second graph shows the MSE from a 10-fold cross validation procedure with ±1 s.d. in dashes, as 
well as the true MSE on a new set of data as the thick line. 



recovering the correct sparsity pattern, is generally understood to be very difficult. (See e.g. Meinshausen 



and Biihlmann 2006 for some attempts.) However, as suggested in literature from Tibshirani 1996 



onwards, cross validation is effective for minimising predictive error, and is illustrated by second graph 
of Figure [3?2l Here, we calculate CV error from a 10-fold cross validation. We may then take the A that 
minimises the average mean squared error across the folds. If we desire a simpler model, we can, as is 
often suggested, take the largest A that achieves a CV value within 1 s.d. of the minimum. Examining 
the thick line for the true predictive MSE shows that such a procedure, while not perfect, can give good 
results. In minimising predictive error, however, we do still fit some irrelevant covariates as non-zero, a 



phenomenon previously observed with the LASSO in Leng et al. 2006 



Now, unlike a LARS-like approach, LISO Backfitting will only give us the solution for an individual 
choice of A. However, CV can still be practical, because coordinatewise minimisation can be very fast 



for sparse problems, something already observed for the normal LASSO Friedman et al. 2007 . We can 



further reduce the computational cost by noting that LISO solutions for similar values of A are likely to 
be similar, and hence use the result for one value of A as a start point for the calculation for a nearby 
value of tuning parameter. This is especially effective if we order the A values we need to calculate in 
decreasing order, since large A solutions are more sparse and so faster to calculate. 

4 Extensions and variations 

A variety of extensions and variations of the basic LISO procedure may be proposed, that may offer 
improvements in some circumstances. 



4.1 Bagged LISO 



Bagging [Breiman|1996 may be used with the LISO, by aggregating the results of applying the LISO to 
a number of bootstrap samples through any of a variety of methods. This usually succeeds in smoothing 
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the observation, especially if we use smoothed bagging Raviv and Intrator|1996 . However, this method 
is not reliably a great improvement in our empirical studies. Further, since the aggregated fit will produce 
a sparsity pattern involving a set of selected covariates that is the union of the selected covariates for 
each individual subsample calculation, we have that bagging will almost inevitably reduce the degree of 
sparsity in the fit, for any given degree of regularisation. 



4.2 Adaptive LISO 

A potential problem with the LISO is that it treats the constituent steps of each fit individually. In other 
words, there is no difference, in the eyes of the optimisation, between a fit that involves single step fits 
in a large number of covariates, and a single more complex fit in one covariate. As a result, the method 
may not achieve a great deal of sparsity in terms of covariates used, an issue we may want to rectify 
through making the algorithm in some sense recognise the natural grouping of steps in the step function 
basis. 



Many existing solutions to this issue, such as Huang et al. 2009 , involve explicitly or implicitly a 



Group LASSO Yuan and Lin 2006 calculation to produce this grouping effect. Incorporating this into 



LISO is possible, though it may produce a greatly increased computational burden. Instead, we shall 



apply ideas from Zou 2006 



Consider the following two stage procedure - we first conduct an ordinary LISO optimisation, arriving 
at an initial fit (/{*, . . . , /p). Then, we conduct a second LISO procedure, this time introducing covariate 
weights wi, . . . ,Wp based on the first fit, and use the results of this as the output. We define the Adaptive 
LISO as the implementation of this, with Wk = 1/A/", ioi k = 1, . . . ,p. 

Algorithm 2 Adaptive LISO 
1: Calculate initial fit using LISO. (For instance, using Algorithm [l]) 

2: Set w;fc = l/A(/0), forfeit,..., p. 

3: Calculate, using e.g. Algorithm [ij 



arg mm - 



Y 



k=l 



^ WkA{fk), with fk £ J^k, k ^ i, - ■ ■ ,P- 



fc=i 



4: If necessary, set f^ = f, and repeat from Step 2. 



The analogy to the adaptive LASSO is that we apply a relaxation of the shrinkage for covariates with 
large fits in the initial calculation, and strengthen the shrinkage for covariates with small fits - indeed, 
omitting entirely from consideration covariates initially fitted as zero. Usually, more than one reweighted 
calculation is not required. 

The Adaptive LISO encourages grouping of the underlying LASSO optimisation because large steps 
contribute to relaxation of other steps in the same covariate. In addition, it means that we in general 
require less regularisation of true fits in order to shrink irrelevant covariates to zero, through the concavity 
of the implied overall optimisation, to which we are essentially calculating a Local Linear Approximation 
Zou and Li||2008 . We will also always enhance sparsity through this procedure - indeed, the fact that 
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we reject straight away previously zero variables ensures the computational complexity of the method is 
usually at most equal to that of repeating the original LISO procedure for each iteration. 

It is, however, not clear what would be the best way to choose the tuning parameter introduced with 



each iteration of the process. We note that the discussants to Zou and Li 2008 have recommended a 



scheme based on individual prediction error minimising cross validation at every step, and our empirical 
studies suggest that this can pose significant improvements over the basic LISO. In our experiments, 
we also implement a variant of the adaptive procedure, LISO-SCAD, where instead the weights are 
calculated with an implied group-wise SCAD penalty. LISO-SCAD and LISO-Adaptive hence both fit 
under a broad group of possible LISO-LLA procedures. 



4.3 Sign discovery and total variation penalty 

Conventional isotonic regression focuses on the scenario where the monotonicity of the model function 
component in each covariate is known. However, this is not always realistic. Especially with large p, 
it may be the case that while we believe that the covariates contribute mostly in a monotonic way, we 
do not know, for at least some covariates, whether the covariate's component fit should be increasing, 
decreasing, or indeed even contribute non-monotonically. It is then perhaps reasonable to attempt to test 



for or estimate this monotonicity. This subject is dealt with by Bowman et al. 1998 , amongst others, 
with a focus on the univariate case. 

In higher dimensions, the problem is more difficult. One possible heuristic is to choose signs by a 
preliminary correlation check with the response. However, correlation is not invariant under general 
monotonic transformations, and examples exist where covariates have positive marginal effects, but, due 
to correlations between the covariates, turn out to have negative contributions in the final model. Now, 
with the LISO, it is trivial to use the same LISO-backfitting method for calculation with relaxation, or 
selective relaxation of the monotonicity condition. In this case, the relaxed form is just minimising the 
residual sum of squares, penalised by the total variation of the fitted step function. In other words, we 
find the minimiser, with /i, . . . , /p being univariate functions that have empirical mean zero and follow 
the specified combination of monotonicity constraints, of 



1 



Y - 



p 

E 

k=l 



p 



+ A> A(/fc), 



k=l 



(4.1) 



where A{fk) is calculated using 

/fe(maxXW)-/fe(minX«) 
A(minXW) -/fe(maxX«) 



Mfk) 



En 
i=2 



One way to implement this is to include reversed versions of non-monotonic covariates in the calcula- 
tion, (and hence fitting a monotonically decreasing function to them as well as a monotonically increasing 
function) and then combine the fits with their corresponding twins after the calculation is complete. 

Definition 4.1. Let {fk)'k=i be any set of right continuous step functions with knots in the fc-th covariate 

,p, define 



.(fc) 



if fk is monotonically increasing, 
if fk is monotonically decreasing, 
otherwise. 



at X 



(fc) 



< 



(k) 

< Xn^ , and for each fc, mean zero when evaluated at these knots. For fc = 1, 
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inductively as the right continuous step function with knot values 
and for j = 1, . . . , 71^ — 1, 



fk (xf) if fk (4+i) > A 



otherwise. 



with Cf ^ chosen so that Y.7=i fk {4''^) = 0- 
Similarly, define f^ as 



fk (4^=0 



<^2 ; 



and for j = 1, . . . , — 1, 



a: (41 + A (45.) - A H") if A (45.) < A (.f ) 

/jT (2;^'^') otherwise, 

with 6*2'^^ chosen so that X]r=i fk ('^i'^^) ~ 

Then, in the fully non-monotonic case, we have the following theorem: 
Theorem 5. Let (gk, hk)^^i be the minimiser, with gk G J-k,hk G —J-k,k 

1 



Mx{{9k,hk)l^,):= 



fe=i 



,...,p of 
A^(A(g,) + A(/i,)), 



(4.2) 



r/ien t/iere is a one-to-one correspondence between such minimisers and minimisers (fk) to (4.1 1 



This correspondence is given by the decomposition above, so that gk = hk — ff. , o.nd gk + hk = fk, 
for all k. 

An alternative implementation to using the above can be found by replacing the PAVA thresholding 
step in Algorithm 1 with a local thresholding style algorithm Mammen and van de Geer|1997l . This can 
be slower, however, due to the computational burden involved with dealing with a covariate that would 



be taken exactly to zero, compared to checking (3.1 1 in the former case 



Now, extending Theorem [Sj the Adaptive LISO can provide an alternative way of dealing with the 
problem of sign discovery. Starting with an initial non-monotonic LISO fit, fk, k = \, . . . ,p say, we can 
conduct a second non-monotonic LISO fit, with covariate weights as in the Adaptive LISO case - except 
that we treat the positive and negative component fits separately, with respect to the weights used. 

Let fk,fk^, k = 1, ... ,p be the decomposed version of the initial fit. The M\ approach will give 



us this decomposition directly, while we can apply the decomposition procedure from Definition 4.1 to 



obtain the appropriate decomposition with the second implementation, or indeed an initial fit found 
by any other method. Then, setting — 1/A(/^), ~ 1/A(/^) we find the LISO adaptive sign 
discovery solution to be simply g -\- h, where 'gk, ^hk G Tk k — \, . . . ,p are solutions of 
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arg mm - 

h-i , h„ 



fc=i 



Thus, as well as the effect seen in the adaptive LISO, where we have strengthened shrinkage of small 
function fits towards zero, functions with small negative or positive components in the initial fit will be 
shrunk towards an monotonically increasing or decreasing function respectively. 



5 Numerical results 

We will present a series of numerical examples designed to illustrate the effectiveness of the LISO in 
handling additive isotone problems. The experiments are calculated in R, using a standard desktop 



workstation. The full path solutions are found using a LISO modification to the Lars algorithm Efron 



et al. 2004 , while the larger comparison studies and fits are conducted using an implementation of the 



backfitting algorithm, with a logarithmic grid for the tuning parameter. 
5.1 Example LISO fits 

The following examples, conducted on single datasets, illustrate the performance of the algorithm. 
5.1.1 Boston Housing dataset 



The Boston Housing dataset, as detailed in Harrison and Rubinfeld 1978 , is a dataset often used in the 



literature to test estimators ~ see e.g. Hastie et al. 2003 . The dataset comprises of n — 506 observations 



of 13 covariates, plus one response variable, which is the median house prices at each observation location. 
The response is known to be censored at the value 50, while the covariates range from crime statistics 
to discrete variables like index of accessibility to highways. We use here the version included in the R 
MASS library, though we shall discard the indicator covariate chas, for ease of presentation. (Experiments 
suggest that this variable does not have a great effect on the response, in any case.) 



As suggested in Ravikumar et al. 2007 , we will test the selection accuracy of the model by adding 



U{0, 1) irrelevant variables. We add 28, so that our final p = 40. Since signs are not known, we will 



apply the sign discovery version of the LISO from Section 4.3 by first conducting a non-monotonic 
total variation fit, and then a weighted second fit. Tuning parameters are chosen by two 10-fold cross 
validations. 

Our selected model, finally, is 

y =a + /i(crim) + /2(nox) + /3(rm) + /4(dis) + /5(tax) + /6(ptratio) + ^(istat) + e. 

The remaining covariates are judged to have an insignificant effect on the response, with zero regres- 
sion fits, /a was found to be monotonically increasing, /i non-monotonic, and the remaining functions 
monotonically decreasing. The full results are shown in Figure |5.1[ 

We see in our experiments that for higher values of A, we successfully remove all the irrelevant 
variables, and end up with only a small number of selected variables to explain the response. However, 
in the one step procedure, the amount of shrinkage required is often large. With cross validation as a 
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crim zn indus nox 




Initial fit 
Second fit 



-1.5 -0.5 0.0 0.5 1.0 1.5 -1.5 -0.5 0.0 0.5 1.0 1.5 -1 1 -1.5 -0.5 0.0 0.5 1.0 1.5 

Figure 5.1: Fitted component functions on the Boston Housing dataset, for covariates originally present 
in the data plus four others. The dashed line shows the selected model after the first LISO step, while 
the solid black line shows the final result of the adaptive sign finding procedure. The single step fit 
produced additional non-zero fits in some of the artificial covariates, which are not shown, while the two 
step procedure fit all of them as zero. 

criterion, we do choose a A that involves some irrelevant variables as well, though these are in general 
small in magnitude. A second step greatly improves the model selection characteristics, as well as creating 
monotonicity which is often absent in the first step. 



It is interesting to contrast our fit with the findings from using SpAM Ravikumar et al. 2007 . Bearing 
in mind that our problem was in some sense more difficult, since we had 12 original covariates instead 
of 10 (rad and zn were not included in the SpAM study), and 28 artificial covariates instead of 20, our 
findings are largely similar. In addition to the covariates selected in SpAM, we add a fairly large effect 
from nox, and smaller effects in dis and tax. The most significant fits on rm and Istat are very similar, 
though the LISO fit is clearly less smooth. However, while almost all of the fits from SpAM exhibit 
non-monotonicity, the LISO fit we have found is mostly monotone, aside from the fit in crim. 

The non-monotonicity found in crim may seem problematic, given the interpretation of that covariate 
as a crime rate. While, nevertheless, this is a characteristic present in the conditional residuals, perhaps 
it would be reasonable to impose a monotonicity constraint instead. 
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5.1.2 Artificial dataset 

We are also interested in the success of LISO in correctly selecting variables for varying levels of n and 
p. We adopt the following setup - we generate pairs X £ W-^p, Y G M" by 

Xij ^ Uniform (0, 1) 

r. = 2(x«)%Xp)+sign(x(3)) 

with n = 1024, p ~ 1024, independent £i ^ N{0, 1). The covariates are then centred and standardised 
to have mean zero and variance 1, and Y is centred to have mean zero. 

For p' = 32,64,128,256,512,1024, n' = 5,10,15,... , we then take as X',Y' subsets of X,Y cor- 
responding to the first p' columns of X, and random samples without replacement of n' rows of X, Y. 
Hence we consider the problem of correctly finding 4 true variables, from amongst p' potential ones, 
based on n' observations. We quantify the success of LISO by looking at the proportion of 50 replica- 
tions where the algorithm, for at least one value of A, produces an estimate where the true covariates 
have at least one step while the other covariates are taken to zero. (We adopt this framework so as to 
reduce the additional noise from generating a complete new random dataset with each attempt.) 



X 



(3) 



2^{X<'">0} 




50 



100 



150 



Sample size 



Figure 5.2: Probabilities of correct sparsity recovery with 4 true nonlinear but monotonic covariates, 
SNR = 4. Each line shows how the recovery probability changes as the sample size n changes for a 
single value of p, taking values 2^, . . . , 2^°. 



Figure 5.2 gives these results. As we can see, as in a variety of LASSO-type algorithms Wainwright 



2006 , there is a sharp threshold between success and failure in recovery of sparsity patterns as a function 



of n. Moreover, as we increase p exponentially, the required number of observations n increases much 
more slowly, thus implying that p ^ n recovery is possible. 

Figure |5.3| gives an example of LISO fits arising from this simulation. The dashed lines shows the 
results of the LISO under the minimum regularisation required for correct sparsity recovery - note the 
high level of shrinkage required to shrink the other variables to zero. This shrinkage exhibits itself as 
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1 2 




-1.5 -0.5 0.0 0.5 1.0 1.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 



Figure 5.3: Example LISO covariate fits, for n — 180, p — 1024. The true component functfons are given 
by the thick hne, while the dashed line gives the raw LISO fit for the smallest amount of regularisation 
required to bring spurious fits in irrelevant covariates to zero. The solid black line shows a fit made by 
the adaptive liso, using tuning parameters found by cross validation. The fitted and true model functions 
for all 1020 remaining covariates are all constant zero. 

not only a thresholding on the ends of the component fits, which wc have seen in the univariate case, 
a , but also an additional loss of complexity in the middle parts of each component fit. We can avoid 
these shrinkages by using this initial result to perform the Adaptive LISO, in the solid black line, thus 
greatly improving the fit while still keeping the correct sparsity pattern recovery. As an added bonus, 
we get good results here with the Adaptive LISO even without using knowledge of the true process that 
generated the data. 

5.2 Comparison studies 

We shall now compare LISO to a range of other procedures in some varying contexts. Varying / between 
scenarios, consider generating pairs X, Y by, for each repetition, 

^ Uniform (—1, 1) , i = 1, . . . , n, j = 1, . . . ,p 
e^-iV(0,l), i = l,...,n 
= f{Xi) + asi, i = l,...,n. 

100 repetitions were done of each combination of model and noise level, with a chosen to give SNR — 

1.3 or 7, plus one further case where we have SNR — 3 but X is instead generated to have stronger 
correlation between the covariates, as a rescaled (to the range (—1,1)) version of ^{Z),Z ^ A^(0,E), 
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with = 2-l*-J'l. 

For comparison, we will compare the performance of LISO and LISO-LLA (both Adaptive and SCAD), 
calculated using the backfitting algorithm, to 



• Random Forests (RF), from Breiman 2001 . A tree based method using aggregation of trees 
generated using a large number of resamplings. 



• Multiple Adaptive Regression Splines (MARS), from Friedman 1991 , using the earth implemen- 
tation in R. A method using greedy forward/backward selection with a hockeystick shaped basis. 
We use a version restricted to additive model fitting. 



• Sparse Additive Models (SpAM), from Ravikumar et al. 2007 . A similar group LASSO based 
method using soft thresholding of component smoother fits. 



Sparsity Smoothness Penalty (SSP), from Meier et al. 20091. A group LASSO based method using 



two penalties - a sparsity penalty and an explicit smoothness penalty. 

For the choice of tuning parameter in all algorithms, we take the value that minimises the prediction 
error on a separate validation set of the same size as the training set. (Note that in the case of SSP, due 
to the slowness of finding two separate tuning parameters, we instead perform a small number of initial 
full validation runs for each scenario. We then plug in the averaged smoothness tuning parameter in all 
following runs, optimising for only the sparsity parameter.) 

We record both the mean value across runs of the MSE on predicting a new test set (generated without 
noise), and, in brackets, the mean relative MSE, defined for the k-th algorithm on each individual run as 

MSE'' 



^'S'^Relative 



minj=i_... J MSEi 



5.2.1 All components linear 



In this case, we have the response being just a scaled sum of A: = 5 randomly chosen covariates, plus a 
noise term, n — 200, p — hQ overall. In the test set, the variance of the response (and hence the MSE of 
a constant prediction) was approximately 1.7. 



Algorithm 


SNR = 7 


SNR = 3 


SNR = 1 


SNR = 3, Correlated 


LISO 


0.113 (4.70) 


0.186 (3.33) 


0.358 (2.41) 


0.203 (3.43) 


LISO-Adaptive 


0.070 (2.94) 


0.118 (2.18) 


0.242 (1.62) 


0.134 (2.27) 


LISO-SCAD 


0.113 (4.71) 


0.186 (3.33) 


0.437 (3.00) 


0.202 (3.41) 


SpAM 


0.082 (3.29) 


0.149 (2.57) 


0.346 (2.24) 


0.159 (2.59) 


SSP 


0.026 (1.00) 


0.061 (1.00) 


0.167 (1.02) 


0.065 (1.00) 


RF 


0.286 (11.97) 


0.319 (5.85) 


0.504 (3.36) 


0.361 (6.21) 


MARS 


0.146 (6.28) 


0.354 (6.72) 


1.027 (6.91) 


0.417 (7.19) 



Because of the sparsity and additivity in the data, all LASSO-like methods do better than RF, 
a pattern that continues in all of these simulation studies. Indeed, due to the random selection of 
covariates in the RF algorithm, the presence of spurious covariates seems to produce a phenomenon of 
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excess shrinkage, which can be clearly see in plots of fitted values versus response values. Using the 
scaling corrections provided in the R implementation improves things, but not to a great extent. MARS, 
similarly, has difficiilty in finding the correct variables. With such large p, the set of possible hockey 
stick bases MARS has to search through is very large, and hence the underlying greedy stepwise selection 
component of the algorithm is in general unsuccessful at handling this problem. 

Amongst the LASSO-like methods, perhaps unsurprisingly, the SSP method performs by far the 
best, owing to the large degree of smoothness in the true model function. LISO- Adaptive is second best, 
however, beating SpAM even though it does not have an internal smoothing effect. The basic LISO 
method itself underperforms, perhaps because it does not strongly enforce sparsity amongst the original 
covariates. 

Unexpectedly, LISO-SCAD performs fairly equivalently to the LISO itself in this and all following 
simulations. A likely explanation is that for sufficient regularisation to take place to take spurious 
covariates to zero, the penalty function is such that the solution lies mostly on the part of the penalty 
where it is identical to the original total variation penalty. 

The introduction of a moderate amount of correlation does not greatly affect the performance of any 
of the algorithms. 

5.2.2 Mixed powers 

In this case, the response has a more complex relation to the covariates: 



k=l 

fi{x) = sign(x + Ci) Ix + Ci| 
f2{x) = sign(x + C2) \x + C2I 
fsix) = sign(x + C3) \x + C3I 
f4{x) = sign(a; + C4) \x + C4I 
f5{x) = X + C5 



In this case, wc have again n = 200, p — 50. C'l, . . . ,0^ arc small shifts, randomly generated as 
Uniform{—l/4:, 1/4), and oi, . . . , as arc covariates randomly chosen without replacement. In the test set, 
the variance of the response was approximately 2.6. 
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Algorithm 


SNR - 7 


SNR = 3 


SNR = 1 


SNR = 3, Correlated 


LISO 


0.128 (1.49) 


0.230 (1.50) 


0.459 (1.41) 


0.255 (1.50) 


LIS 0- Adaptive 


0.088 (1.01) 


0.160 (1.00) 


0.352 (1.06) 


0.177 (1.01) 


LISO-SCAD 


0.128 (1.49) 


0.229 (1.49) 


0.587 (1.82) 


0.254 (1.50) 


SpAM 


0.157 (1.83) 


0.267 (1.75) 


0.539 (1.68) 


0.285 (1.69) 


SSP 


0.126 (1.47) 


0.226 (1.49) 


0.429 (1.33) 


0.252 (1.51) 


RF 


0.358 (4.21) 


0.450 (2.96) 


0.721 (2.26) 


0.495 (2.96) 


MARS 


0.319 (3.78) 


0.678 (4.54) 


1.936 (6.32) 


0.783 (4.71) 



With the new, non-hnear model function, the LISO and LISO-SCAD now perform equally as well as 
the SSP, while the adaptive LISO performs significantly better, being the best in almost all runs. All 
four methods outperform SpAM, and greatly outperform RF and MARS. 

In this case, the explanation is that for fractional powers, the component functions are relatively flat 
in the extremes of the covariate range, with most of the variation occuring in the middle of the range. 
SpAM and SSP are unable to capture the sharp transition point of the small root functions without 
introducing inappropriate variability at the ends of the fit, and hence both perform significantly worse 
than previously. The LISO based methods, however, do not explicitly smooth the fit and only threshold 
the extremes. Being thus adapted to this sort of function, they actually improve their performance in 
proportional terms relative to the variance of the test set. 

5.2.3 Mixed powers, large p 

In this scenario, our model is the same as before, save that we have many more spurious covariates, 
resulting in n = 200, p = 200. The variance of the test response is unchanged at approximately 2.6. 



Algorithm 


SNR = 7 


SNR = 3 


SNR = 1 


SNR = 3, Correlated 


LISO 


0.166 (1.89) 


0.283 (1.86) 


0.638 (1.78) 


0.286 (1.84) 


LISO-Adaptive 


0.090 (1.00) 


0.156 (1.00) 


0.384 (1.01) 


0.160 (1.01) 


LISO-SCAD 


0.169 (1.93) 


0.292 (1.91) 


0.935 (2.71) 


0.296 (1.90) 


SpAM 


0.201 (2.32) 


0.329 (2.17) 


0.779 (2.21) 


0.331 (2.14) 


SSP 


0.156 (1.78) 


0.274 (1.80) 


0.604 (1.73) 


0.274 (1.78) 


RF 


0.504 (5.86) 


0.588 (3.86) 


0.992 (2.84) 


0.593 (3.84) 


MARS 


0.805 (9.27) 


1.704 (11.49) 


4.707 (13.84) 


1.763 (11.60) 



In this case, LISO preserves its superiority. Due to the effect of high dimensionality, all algorithms 
see their performance decline - except the adaptive LISO, which has an increased MSE of less than 3% in 
the low noise case. This is due to the adaptive step, which retains a very sparse fit, picking the relevant 
variables even as the number of predictors grows. 

6 Discussion 

We have presented here a method of extending ideas from LASSO on linear models to the framework of 
non-parametric estimation of isotonic functions. We have found that in many contexts, it inherits the 
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behaviour of the LASSO in that it aUows sparse estimation in high dimensions. By using our backfitting 
procedure, we have also shown empirically that it can be very competitive with many current methods, 
both in terms of computational time and memory requirements, and in terms of predictive accuracy. The 
precise criteria that govern its success would require further work, and it would be interesting to see if 
similar LASSO-style oracle results apply. 

In addition, we find that an LL A/adaptive scheme is highly effective and efficient at improving the 
algorithm in a two step approach, producing sparser results and very high predictive accuracy. Further 
adaptations allow the LISO method to be used when monotonicity is assumed but the direction of the 
monotonicity is not know. To the authors' knowledge, this has not been attempted previously in this 
type of problem, and it would be interesting to see if LLA and similar concave penalty procedures can 
produce effective replacements for the group LASSO in the underlying calculation of non-parametric 
LASSO generalisations. 



fiix) = f>A,<Bix) := < 



A Appendix: Proofs of theorems 
Proof of Theorem [l] 

Our methodology is to show that adding boundary constraints to constrained or unconstrained isotonic 
regression problems result in unique solutions that are simply Winsorised PAVA estimates, and then 
demonstrate a method of constructing any LISO solution, in the univariate case, though boundary 
constraints. We prove first the following lemma, which provides an induction step in our eventual 
argument: 

Lemma 1. Suppose for A < B , X ^ {Xi, . . . , Xn), Y = (Yi, . . . , F„), Xi e M for all i, the Winsorized 
PAVA, 

A lffpAVA{x)<A 

B iffpAVA{x)>B 

fpAVA{x) otherwise. 
solves the boundary constrained isotonic regression problem, 

min \\Y — f{X)\\^ such that f monotone, A < f{x) < B, Vx. (A.l) 

Then for A < A' < B' < B, f2 = f>A'.<B' solves the further constrained isotonic regression problem 

mm||r- f{X)f such that f monotone. A' < f{x) < B' , Vx. (A.2) 

Further, this solution is unique, in terms of its fitted values f{Xi), . . . , f{X„). 

Proof of Lemma^ It suffices to prove the lemma for the case oi A < A' < B' — B , since the argument 
ioi A = A' < B' < B is identical, and we can proceed to the full Lemma by adding the top and bottom 
constraints one by one. 

Note that, specifying / through the fitted values /(Ai), . . . , /(A„), ||F — /(A)||^ is strictly convex 
when considered as a function of /, and the constraints give a convex feasible set. Hence, for any 
combination of A, B, solutions must exist and be unique at Ai, . . . , A„. 
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Therefore, let g be the solution of the optimisation (A. 2), for A < A' < B' = B. Suppose for 
contradiction that 3^/2 = f>A'.<B'- 

Let Uf,Ug be the points where the functions /2,.9 respectively exceed A' . 

Uf=mi{X, s.t. h{X,)>A'} 
Ug = M{Xi s.t. g{Xi) > A'} . 

Then we have two cases. 

(a) If < Ug, then consider a new function / where 



fi (x) if X <Uf 
g{x) if X > u-f. 



(A.3) 



/ would be an increasing function satisfying the conditions of ( A.l ). Because g ^ f2, and g and /2 are 



both equal to A' when restricted to {a; : x < u/}, it must be the case that g ^ f2 = fi when restricted 



to {a: : a: > Uf}. Therefore, / ^ /i. The residual sum of squares is then, applying (A. 2) optimality of g 







Xi<Uf 


X,>Uf 




^\\Y~g{X)f- J2 (Y^-^'f 


Xi<Uf 


Xi<uf 




^\\Y-MX)f- E (Y^-A'f 


Xi<Uf 


X,<Uf 






Xi<Uf 


X,>uf 


= \\Y~MX)\\\ 





Therefore, / is optimal for (A.l). This contradicts uniqueness and (A.l I optimality of / 



(b) If uy > Ug, then if we define / this time as 



/l (X) if X <Ug 

g{x) if x>Ug, 



(A.4) 



we obtain another increasing function satisfying the conditions of (A.l). As before, because we have 
assumed that g ^ /2, and yet g = /2 = A' for {a; : a; < Ug}, 5^/2 = /! when restricted to {a; : a: > Ug}. 
This means that / ^ /i, so unique optimality of /i versus / means that 



E {Y.-h{x.)f = \\Y-h{x)\\'- E m-/i(^.)r 

Xi>Ug XiKUg 

2 



<\\Y-f{x) E {Y,-Mx,)y 

Xi<Ug 

= E {Y,~g{X,)f 



Xi>Ug 
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In addition, because Uf > Ug, setting 5 = {g{ug) — A')/{g{ug) — fi{ug)) makes g, defined as 

A' ii X < Ug 

[l-5)g{x)+5Ji{x) if x>Ug, 



(A.5) 



an increasing function satisfying the conditions of (A. 2). By definition, g{ug) > A' and fi{ug) < A', so 
S G (0, 1), implying that 5 is a nontrivial convex combination of g and /i, when restricted to {x : x > Ug}. 
But by convexity, 



\\Y~g{X)\\'^ {Y,~A'f 

Xi<Ug 

< J2 (Y^-A'f 



^ (r,-(i-%(x,)-<5/i(xof 

Xi>Ug 

J2 iY.~giX,)f = \\Y-giX)f. 

Xi>Ug 



□ 



Xi<Ug 

This contradicts optimahty of g. 
Therefore, g = f2 is the unique solution to ( |A.2 ) 

Proof of Theorem^ We can prove Theorem [T] as a simple corollary. 

When A = 0, our objective function L\ is strictly convex and quadratic, and indeed is the same 
as the PAVA optimisation. Hence, an unique optimal solution exists, and is given by /o = fpAVA- 
Set ^0 = —oo^Bq = 00. Then /q = />Ao.<-Bo feasible for, and so, must also solve the constrained 



optimisation (A.ll, with constraints at infinity. 



For X > 0, L\ and the domain we maximise it in are both still convex, with Lx strictly convex and 
increasing away from the origin outside of a neighbourhood. Therefore, an unique bounded solution fx 
must exist. Set A\, Bx to be the upper and lower bounds of this solution. 

Ax = min fx (X, ) , Bx ^ max fx{Xi). 

i i 

Then consider the solution to the constrained isotonic least squares problem 



fx = argmin||y — f{X)\\ such that / monotone, Ax < f{x) < Bx, Vi. 



(A.6) 



^(/a) < Bx — Ax — A(/a), and since fx is feasible for 



A.6 



Y-fx{X) 



< 



Y-fxiX) 



Therefore, fx is optimal for (2.1 ). Hence, by uniqueness, fx = fx, so for suitable Ax, Bx it suffices to 



solve the bound constrained least squares optimisation (A.6), to find the LISO solution 



But by Lemma[l] the solution of (A.6 1 for > Ax < Bx is just the Winsorised PAVA solution /; 



so fx = fx = f: 



>Ax,<Bx 



>Ax,<Bxi 
□ 



Proof of Theorem [2] 

Proof For A = 0, /o = fpAVA, so choosing Aq = min(/p^vA(a:^)), Bq = Tna-x{fpAvAix)) is clearly a 



optimum for (2.l|) that satisfies (|3.2|) and (|3.3|). Assume therefore A > 0. 
Now, from 



Barlow et al. 



1972| , the unregularised PAVA solution fpAVA, considered as a right contin- 



uous step function, is an example of a regressogram. In other words, there exists a partition into disjoint 
intervals of M, Pi, ... , Pm, with the value of fpAVA{x) on each interval being the mean of the observed 
Y for X falling within the interval. 
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Ipava{x) = ™ ^^{^-g^^} = fpAVAiPj) for all x e P,- 
Using /o = /pAV/i, the LASSO criterion for the thresholded function f>A,<B then can be written as, 
Lx{f>A,<B) = 1\\y- f>A,<Bf + \{B - A) 



1 " 

= 2 ^ (^^* ~ •^o(^i))^-f{/o(x.)e[A,B]} + - ^i)'^hMX.)<A} + O^i ' ^^hfoiXi): 

i=l 

+ X{B - A) 

i=l 

1 " 

+ 2 E - + 2(/o(^i) - - /o(^0)) ^{/„(x.)>B} + ^(^ - ^) 

m n 

3=1 i=l 

rn n 

+ E E ((^ - MXiWoiX,) - muoix,)<A}) I{x.eP,} 
j=l i=l 

z=l 

m n 

+ J2{A - fo{Pj))+ J2 {iMX^) - n}) I{x.eP,} 



j=l i=l 



Y-fo 



2 1 

+ 2. 



((A - /o(X0)^ + ifoiX,) - B)l) + X{B - A). 

i=l 

We seek a minimum to this with A< B. Differentiating in A and B, and setting equal to zero, gives, 

n 

^(/o(X,)-i?)+ = A (A.7) 

i=l 

n 

^(A-7o(X,))+ = A. (A.8) 

Note that the left hand side in both cases is a piecewise linear, continuous and monotone (indeed, 
decreasing in B and increasing in A) function of the threshold, equalling zero for A = Ao,B = Bq. 
Further, since the PAVA is a regressogram, /o(-^i) = so 



n n 1 ^ 

Y^iUX,) - F)+ = ^(F - /o(X,))+ = - ^ \fpAVA{X,) - Y\. 
i—1 i—1 i=l 

We therefore have two cases. If 

n 

'i\<Y.\fpAVA{Xi)-Y\, 



i=l 
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then we can find solutions to (A.7) and (A.8) for wliicli A\ < Y < B\, producing a minimiser for 



(2.1) 



Otfierwise, (A.7) and (A.8 1 require that A > Y and B < Y. Hence, there are no solutions to our 



minimisation with A < B. 

For a solution on the boundary, we require A = B, so 



n 

L^{f>A.<B) = ^Y.^Y,-Af 



This is clearly minimised by Ax — B\ —Y . 

Corollary [3] follows from the fact that from properties of the PAVA, 



arg max 



max 



: fpAVA{XT,[,n)) < o| 



□ 



Proof of Theorem |4] 

Proof. If we go through the covariates in a pre-determined order, then we can apply a theorem proved 



Tseng 2001 . However, the proof is simplified in the case where we go through the variables in a 



random, independent order with each iteration, which we will show now. 

Because we are doing repeated minimisations, for all to, < ^a(/™)- Moreover, L\ is 

bounded below by 0. Therefore, L\{f"^) must converge monotonically in probability as to increases. 

Now, choose any > 0. We define A^^ as the event that there exists fee {1, . . . ,p} and : M — >• K 
such that 

the event that we can improve on the current fit by at least 5i by changing only one component estimate. 

Now, with the to + 1th iteration, we have an equal probability of picking any one of p covariates as 
the first fitted of our new backfitting cycle, and hence 

Prob {LxiD - Lx{r+^) > <5i) > Prob(A,„)/p. 

But L\{f"^) converges in probability, so Prob(Am) — > for all Si > 0. 

Lx is continuously differentiablc in the interior of (BJ-'k- The set / e ®J^k such that Lx{f) < Lx{f^) 
is closed, and compact. Therefore, the above implies that for any 77 > 0, > 0, there exists A/i such 
that for all to > Mi, the subdifferential of Lx at /™ contains a plane that is within 82 of zero with 
probability at least 1 — r?. 

Therefore, by considering sufficiently small values of (S2, this implies that for all -q > 6j,> 0, there 
exists M2 such that for all to > M2, there exists with probability at least 1 — ?7, another sum of functions 
Z'" S ®J^k satistifying ||/™ — /™|| < ^3, at which Lx contains the zero plane in its subdifferential. 

Since Lx is convex, J™ is a global minimiser of Lx- Taking rj^5^ to zero, we see by continuity of Lx 
that Lxif™') converges in probability to the global minimum. 

In addition, this implies that if the global minimum is unique, and so equals Z™ Vto, must 
converge to it. 

□ 
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Proof of Theorem [5] 

Proof. For simplicity, consider only the univariate case. Assume further for simplicity of notation, by 
permuting the observations if neccessary, that the covariate is sorted in that Xi < . . . < X„, for 
I = 1, . . . , n — 1. 

Let G be the set of pairs of monotonically increasing and monotonically decreasing functions g, h, 
with mean zero, with the constraint that in each interval at most one function of the two changes. Hence, 
with ordered covariate observations, either g{Xi^i) = g{Xi) or h{Xi^i) = h{Xi). 

Observe that for any right continuous mean zero step function /, any pair of monotonically increasing 
and monotonically decreasing mean zero functions g, h satisfying g + h = f can only minimise Ag + Ah 
among such functions if (5, h) G Q. Otherwise, if the constraint is broken for i, defining 

g{x) = g(x) - min (.g(X,+i) - g{Xi), h{Xi) - h{Xi+i)) I{x>x,+i} 
h{x) = h{x) +mm{g{X,+i) - g{Xi),h{X,) - h{Xi+i)) I^^^Xi+i}, 

gives g + h = g + h, and A(g) + A(h) = A{g) + A{h) - 2mm{g{X,+i) - g{Xi),h{Xi) - h{X,+i)). 



Therefore solutions to (4.2) must lie within Q. 

Now, it is trivial to see that the decomposition in Definition |4. 1| maps right continuous step functions 
with mean zero to pairs in Q. Two such step functions have the same decomposition if and only if they 
are equal at all knot points, and so, are for our purposes equivalent. Summation is an inverse with these 
spaces, since any such step function can be constructed as the sum of its decomposed functions, and we 
can produce any element in Q by from a right continuous mean zero step function by decomposing its 



sum. Thus, Definition 4.1 gives us a one to one map between the feasible set of (4.1) and a set that 



contains all solutions of (4.2) 



Let / be any right continuous step function with mean zero, and let (/+ , / ) e be its decomposition. 
By construction, 

n-l 



A(/+) + Air) - - /(^0)^{/(x.+,)-/(x.)>o} + (.fix,) - /(X4+i))/{/(x,+,)-/(x.)<o} 

4 = 1 

n-1 



Hence, 

Ma (/+,r) = ^ 11^ - w - r wf + A(A(/+) + A(r )) 

^^\\Y-fiX)f + \Aif) = L, if). 

Therefore, minimising L\ means the corresponding decomposed functions must minimise M\, and vice 
versa. Hence, the decomposition/summation transformations give a one to one map between solutions 



to (|42|) and ( |2.l[ ). 

The case of multiple covariates can be dealt with by applying the some argument to each covariate 
in turn. □ 
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